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Abstract 

We introduce the idea that resampling from past observations in a Markov Chain Monte 
Carlo sampler can fasten convergence. We prove that proper resampling from the past 

H 

^yy • does not disturb the limit distribution of the algorithm. We illustrate the method with two 



examples. The first on a Bayesian analysis of stochastic volatility models and the other on 
Bayesian phylogeny reconstruction. 
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Markov Chain Monte Carlo (MCMC) methods have become the standard computa- 
tional tool for bayesian inference. But the great flexibility of the method comes with 
a price. Namely, it is very difficult to determine a priori (before the simulation) or 
a posteriori whether a given MCMC sampler can mix or has mixed in a given com- 
puting time. The challenge becomes that of designing fast converging Monte Carlo 
algorithms. Contributions in this field can have significant impact in other scientific 

disciplines where these methods are used. 
1 This work is funded in part by NSERC Canada 

2 Department of Mathematics and Statistics, University of Ottawa, email: yatchade@uottawa.ca 
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In this paper, we propose a new and general approach to increase the conver- 
gence rate of MCMC algorithms. The method is based on resampling. Suppose that 
at time n, we want to sample X n in a MCMC algorithm. Instead of sampling X n 
from P(X n -i, •) for some transition kernel P, we propose to obtain X n by resam- 
pling independently from {Xb, • • • , X n _i}, where B > is some burn-in period. This 
resampling from the past step is then repeated during the simulation at some prede- 
termined times di < a 2 < . . .. Basically, the idea is to look at {X B , . . . ,X n _i} as a 
sample from n. Therefore resampling from the past allows the sampler to move more 
easily and according to a distribution that is close to ir. The resampling schedule 
plays an important role. As long as we do not resample too much (typically, we need 
(a n ) such that a n /n — > oo as n — > oo), we show that resampling from the past does 
not disturb the limit distribution of the sampler. 

Resampling from the past can perform poorly if the original sampler has a very 
poor convergence rate. We extend the framework above by allowing resampling from 
an auxiliary process that has a better convergence rate towards its target 

distribution n^°\ Resampling from an auxiliary proces s is not new and is the idea 



200fih . But the 



behind the equi-energy sampler recently proposed by (|Kou et al. 
equi-energy sampler has a number of complications that we avoid here by using an 
importance-r esampling. The idea is also apparent i n the "Metropolis with an adaptive 



proposal" of (jChanvean and Vandekerkhovf 



20011 ) . On the theoretical side, we show 



in the case of importance-resampling, that resampling from an auxiliary process does 
not disturb the limit distribution of the sampler. 

We apply our methods to two examples from Bayesian data analysis. First, we 
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consider the Bayesian analysis of stochastic volatility models (IKim et al 



1998). We 



improve the efficiency of the basic Gibbs sampler for this problem by a factor of fifty 
(50). In the second example, we look at Bayesian phylogenetic t rees reconstruction . 



Our methods improve the efficiency of the MCMC sampler of (jLarget and Simon 



19991 ) by a factor of hundred (100). 

The paper is organized as follows. In Sectional we present the idea of resampling 
from the past. Resampling from an auxiliary process is discussed in Section 03 All 
the theoretical proofs are postponed to Section 03 and the simulation examples are 



presented in Section HJ 



2 Resampling from the past 

Let {X n } be a Markov chain with state space (X, £>), transition kernel P and invariant 
distribution n started at X Q = x. If the chain is ergodic then C x (X n ), the distribution 
of X n , will converge to ir as n — >• oo. But it is well known that for MCMC algorithms, 
the convergence of £ x (X n ) to n can be too slow for the sampler to be useful. We 
propose the following idea to accelerate the convergence of Markov chains. Suppose 
that after a burn-in period B, we have the sample {Xb, Xb+i, ■ ■ ■ ,X n _i} at time n. 
Instead of sampling X n ~ P(X n _i, •) as we normally do, we obtain X n by resampling 
independently and with equal weight from {Xb, Xb+i, ■ ■ ■ , X n _i}. The resampling 
step is then repeated at some predetermined times a\ < < ■ ■ .. Intuitively, if P 
mixes reasonably well, {Xb, Xb+i, ■ ■ ■ , X n _i} can be seen as a sample points from 7r 
and resampling will operate as an i.i.d. sampling from n. 
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Consider the following toy example. We want to use the Random Walk Metropolis 
(RWM) algorithm with proposal density q(x,y) = Af(y — x;0,a 2 ) with a = 0.1 
to sample from the standard normal density M(x;0, 1); where M{x\ /i, a 2 ) denotes 
the density of the normal distribution iV(/i, a 2 ) with mean /i and variance a 2 . We 
compare the plain RWM with a RWM with resampling. Each sampler is run for 
25,000 iterations. Graph 1 (a) shows the last 5,000 sample points and Graph (b), 
the autocorrelation function from the last 20, 000 points in the plain RWM sampler. 
For the RWM with resampling, we resample at times B + \k a ] (see the justification 
below), with B = 5,000 and a = 1.3. Graph 1 (c) and (d) show the corresponding 
results for the RWM with resampling. As we can see, there is a significant gain in 
efficiency. 

Intuitively, resampling helps to the extend that P mixes rapidly. Differently put, 
the slower P converges to ir, the longer we should wait between two resampling. What 
should be the resampling schedule (a^)? Obviously, we should not resample all the 
time. We find that the choice = b\ + b 2 k a , a > 1 is a valid choice and works well 
in practice for b 2 = 1, and a ~ 1.3. The choice = bi + b 2 k is also theoretically 
valid as long as b 2 , the time between two resampling, is large enough. 
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Graph 1: Comparing a plain RWM and a RWM with resampling in sampling from 
the standard norma distribution N(0, 1). 

2.1 Theoretical discussion 

What can we prove about this algorithm? We can prove that despite the resampling, 
the limit distribution of the algorithm is n under certain conditions on P and on 
the resampling schedule (a^). We recall the algorithm. The resampling schedule 
< cii < 0,2 < ■ ■ • < a n < oo is given and is nonrandom. Fix B the burn-in 
period. We start the sampler at some arbitrary point X = x. At time n > 1, given 
{X , . . . ,X„_i}, if n > B and n = a k for some k > 1 then X n ~ Z^"=s^(")- 
Otherwise sample X n ~ P(X n _i, •). We denote Pr the underlying probablity measure 
and E its expectation operator. Here are some standard notations that we use below. 
If Pi and P2 are two transition kernels on X, the product P\Pi denotes the transition 
kernel P\P2(x,A) := J P±(x , dy) P^{y , A) . Recursively, we can define by Pi = P\ 
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and P™ = P™~ P±. A transition kernel Pi defines a linear operator (also denoted Pi) 
on the space of M-valued functions on (X, B) into itself, by P\f{x) := f Pi(x, dy)f(y). 
If /i is a signed measure on (X,B), we denote fi(f) := J fi(dx)f(x) and we will also 
write fi to denote the linear functional on the space of M-valued functions on (X, B) 
thus induced. Finally, we define pPi(A) := J p(dx)P\(x, A). Let V : X — >• [1, oo) 
be given. For / : — > R, we define its V-norm \f\ v := sup xg ^ y^x and we 
introduce the space L v :={/: A" — ► M : \f\ v < 00} . For a signed measure /i 
on we define its V^-norm \\n\\ v :— supj gLv |/| <i IM/)I- Similarly, for a linear 

operator T from the space of M-valued functions on X into itself, we define |||T||| y : = 
supj GLv |/| <i \Tf\ v . If IT ly < 00, then T defines a bounded linear operator from 
the Banach space (Ly, \-\ v ) into itself. 

We assume that the transition kernel P in the algorithm is geometrically ergodic 
in the sense that: 

Assumption (A): P is irreducible, aperiodic and there exists p G (0,1), a measurable 
function V : X — > [1, 00) such that 

|||P--7r|||y = 0(p"), (1) 



This assumption implies that Tr(V) < and that suv„ P n V"(x) < 00 for any 



x G X, a G [0, 1]. We refer the reader to (jMevn and Tweedie 



1993) for more on 



geometrically ergodic Markov chains. This is a convenient assumption that is known 
to hold for many MCMC sampler. 

Define c := ^- and 5 n := -a x log(p) + Yl=2 lo g( a fc) ~ lo g( c + a k-i)- 
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Theorem 2.1. Assume (A). Then there exists a constant C G (0, oo) such that for 
a k <n < a k +i : 

|||£ (n) -7r||| v <Cp"- fe expH fe ], (2) 

where the transition kernel is defined by C^ n \x,A) := Pr [X n G A\X = x]. In 
particular if d~ n — > oo as n — > oo, the algorithm has limit distribution ix. 

Proof. See Section ©. □ 

Resampling from the past can sensibly reduce the autocorrelation in the output 
of a MCMC algorithm. But when the sampler has a very slow mixing time, it might 
be better to resample from an auxiliary process that has a better mixing time. 



3 Resampling from an auxiliary process 

As above, ir(dx) oc h(x)X(dx) is the probability measure of interest on the measure 
space (X, £>, A). We introduce another probability measure n^^dx) oc h^(x)\(dx) on 
(X, B, A). Let {Xn^} be a Markov chain with invariant distribution ir^ and transition 
kernel PW. Let k : X x X -> [0, oo) be a measurable function and T a transition ker- 
nel on (X_, B). Define the transition kernel Q(x,dy) = ^ ^ j ^o)^z)k\x^z) dy ^ • Following 



( Tierney 



1998T ). let S C X x X be such that the probability measures ir(dx)Q(x, dy) 
and ir(dy)Q(y, dx) are mutually absolutely continuous on S and mutually singular on 
X\S. 

We assume that {Xi 0) } converges (reasonably quickly) to 7r^ '. Let P be a tran- 
sition kernel with invariant distribution n and 6 G [0,1]. The algorithm works as 
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"(o) v (o) 



follows. Given (A^ u; , . . . , AT\ X Q , . . . , X n ): 

• with probability 6, we sample X n+ i from P(X n , 



with probability 1—6, we propose F from R n (X n , •) where R n (x, A) = — ' =Q n ' ' — 

z2i=o H x >x l 

In other words, we resample Y\ from {Xq°\ . . . ,Xn^} with weights k(X n , xf^) 
and propose Y ~ •). 

Then we either "accept" Y and set X n+ i = Y with probability a(X n ,Y), or 
"reject" F and set X n+1 = X n with probability 1 — a(X n , Y), where 



mm 



■x{dy)Q(y,dx) 
Tr(dx)Q(x,dy) 



if (x, y) E S 



a(x,y) = { (3) 
otherwise. 

For n large enough, a sample from R n (x, ■) can be seen as a sample from Q(x, dy) 
which explain the acceptance probability (JHJ). But the algorithm is not feasible as 
such because the ratio in (jHJ cannot be computed in general. The natural choice 
which simplifies Q is to choose a transition kernel T that is invariant under n and 
k(x, y) = uj(y) = h(y) j hS Q > (y) . With this choice, we get a(x, y) = 1 on S. We call this 
scheme importance- sampling resampling. It is not necessary to choose a complicated 
transition kernel for T. Throughout, we choose T to be the identity transition kernel, 
T(x, A) = 1a( x ) m which case S = {(x, y) : < h(x)k(x,y) < oo}. 

Another choice for which the acceptance ratio a{x, y) simplifies is T(x, A) = 1a{x) 
and k(x, y) = l{D(x)}(y) where (DA is a given partition of X and D(x) = Djjfx G Z?j. 



2nnd). with 



This corresponds to the set-up of the equi-energy sampler of (|Kou et al. 
this choice of k, the acceptance probability becomes a(x,y) = min ^1, ^||yj (and 
if uj(x) = or uj(x) = oo). The drawback with this choice is that we have to define 
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the partition {D,i) in the first place and an inadequate partition can result in a high 
rejection rate for the resampling step. 



Algorithm 3.1 (MCMC with Importance- Resampling from an auxiliary 
process). At some time n> 1, given (Xq*\ . . . , Xn\ X , . . . , X^j : 

(i) With probability 6, sample X n+ \ from P(X n , •). Otherwise with probability 1 — 6 

sample X n+ i from 

EILcM^ (0) )V>(-) 

i 

EILo^ (0) ) 

(ii) Sample X^ from P^(xL°\ ■). 



3.1 Theoretical discussion 



We look more closely 



( Atchade and Liu . 



to {X n } when the importance-resampling scheme is used. 



20061 ) have shown that the limit distribution of the equi-energy 



sampler is indeed n under a number of conditions. We can study the process |X„ j 



along the same line. The assumption we impose are less stronger than in (jAtchade and Liu . 



2006). We continue with the notations in Section 12711 Essentially we will assume that 
p(o) j s geometrically ergodic and that the weight function satisfies \u\ Va < oo, for 
some a G [0, 1/4). Typically uj is bounded. 



Assumption (AO): is irreducible and aperiodic and there exists po G (0, 1) such 
that 



\p(or 



7T 



(0)1 



0(p 



(4) 
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where V is as in (A). 

Theorem 3.1. Assume that P satisfies (A), P^ satisfies (AO) and \co\ va < oo 
for some a G [0,1/4). Then for any measurable function f : X —> R such that 

\f\v < °°> 

E[f(X n )\X = x}—+TT(f), asn^oo (5) 

and 

^ n—l 

-tt(/))-^0, asn^oo. (6) 
Proof. See Section El □ 



4 simulation examples 



We illustrate the methods developed above with two examples from bayesian mod- 



elling. In the first examp 



models ( if Kim et al 



e, we consider the Bayesian analysis of stochastic volatility 



1998)) and in the second examp le, we look at Bayesian phyloge- 



netic trees reconstruction ( ifLarget and Simon 



1999)). 



4.1 Bayesian analysis of stochastic volatility models 



We consider the Bayesian analysis of the basic stochastic volatility model: 



yt = e ht/2 e t , t = 0,...,T 
h t+ i = \i + (j)(ht - /i) + au t , t = 0,...,T-l, 



(7) 



where (et) and (ut) are two uncorrelated sequences of i.i.d. standard normal random 
variables. We assume that ho ~ N (ji, jz^zj and \(f>\ < 1 to assure the stationarity of 
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the process (h t ). We observe (y t ) but not (ht), the so-called volatility process. The 
objective is to estimate 9 = (a, (ft, (3) where (3 = e^ 2 . This model and its generaliza- 
tions have attracted attention in the financial econometrics literature as a better way 



to model financial markets series. A bayesian approach to ana 



yze this model has 



1998) and the references 



been proposed by a number of authors (see e.g. (jKim et al. 
therein). The difficulty is that the volatility process (h t ) is not observed making the 
likelihood of 9 analytically intractable. The natural solution is to see (h t ) as a pa- 
rameter and to design a Gibbs sampler on the posterior distribution ti(9, h , . . . , h T ), 
of the parameter 9 and the volatility process (h , . . . , /iy). But, due to the high au- 
tocorrelation in the volatility process, this sampler mixes very slowly. This mixing 
problem has motivated some authors to propose more sophisticated reparametrization 
of the model for better MCMC convergence. We show here that by resampling from 
the past in the Gibb s sampler, we can match the performances of the sophisticated 



solution proposed in (jKim et al 



1998). 



1998) and essentially 



We use the same prior distribution for 9 as in (|Kim et al 
the same Gibbs sampler to sample from n(9, h , . . . , h T ) except when sampling from 
the conditional 7i(h t \9,h-t). To sample from this conditional, we use an Indepen 



dent Metropolis sampler instead of the Accept-Reject method adopted in (jKim et al 



19981 ) . The proposal distribution of our Independent Metropolis s ampler is t 



ne same 



as the dominating dist ribution in the A ccept-Reject sampler of (|Kim et al 



We refer the reader to 



Following (jKim et al 



K m et 



al 



1998) and 



1998). 



1998 ) for the details. 



Shephard and Pitt 



19971 ) . we use model ((Zj) to 



analyze the Sterling dataset, which gives the daily observations of weekday close ex- 
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change rates for the UK Sterling/US Dollar exchange rate from 1/10/81 to 28/6/85. 
The total number of observations is T = 946. We first center the series with the 
formula 



yt = ioo 



where (r t ) is the ob- 



(log(rt) - log(rvi)) - - E"=i( lo g( r j) - 
served exchange rates. We then model (y t ) with the model 0. 

We compare the plain Gibbs sampler with the 2 strategies discussed above: a 
Gibbs sampler with resampling from the past and a Gibbs sampler with resampling 
from an auxiliary process. To assure that the three sampler have about the same 
computational cost (storage requirement aside), we set the auxiliary process to be 
another copy of the plain Gibbs sampler with the same target distribution. The 
three samplers are run for N = 250, 000 iterations. For each sampler and for each 
of the variables a, 0, (3, we give a plot of the last 5, 000 sample points together with 
the histogram and the autocorrelation function from the last 100, 000 points. When 
resampling from the past, the resampling schedule used is B + \k] a , B = 125,000 
and a = 1.25. For the third sampler with resampling from an auxiliary process, each 
of the two chains is run for 125, 000 iterations. The results of the variable a (resp. 
and (3) are given in in Graph 2 (resp. Graph 3 and Graph 4). On each graphics, 
the first column gives the result of the plain Gibbs sampler, the second column gives 
the results of the Gibbs sampler with resampling from the past and the results of the 
third sampler are in the third column. 

Clearly, resampling from the past signi ficantly improve o n the Gibbs sampler. To 



quantify the gain, we compute, following (|Kim et al 



1998) the inefficiency of each 



sampler on each of the three variables. For a Markov chain with transition kernel P 
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a 



4> P 



Plain Gibbs 448.12 211.55 1.54 

Gibbs with resampling 10.96 4.92 0.97 
Gibbs with Aux. Proc. 12.24 9.91 1.39 



Table 1: Inefficiencies of the samplers for the Sterling dataset. 
and invariant distribution tt, the inefficiency at / is: 

00 

/(/) = 1 + 2X>(/), 



(9) 



k=l 



where p k (f) = Cov^ {f(X k ), f(X )) /Var n (f(X )) = n (fP k f) /it (f) . Basically, it 
is the cos t of using a depen dent process to sample from tt. To estimate /(/), we use, 



following 



Kim et al 



2B 



B 



1=1 



(10) 



where Pi(f) is the usual estimate of the autocorrelation at lag i for / and K the 
so-called Parzen kernel. We use B = 5,000. The result is given in Table 1. 

By resampling from the past or from an auxiliary process, we obtain a sampler 



'forms ( 


Shenhard 


and Pitt 


jxim et al.. 


1998 


)• 



1997V ) and is as efficient as the offset mixture 



4.2 Bayesian phylogeny reconstruction 

Since Darwin's theory of evolution, methods to reconstruct the evolutionary rela- 
tionships between different species have become important. We are concerned here 
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with the statistical inference of phylogenetic trees based on molecular sequences. Re- 
cently, more realistic models have been considered in this field owing to the MCMC 
machinery. We show here that MCMC samplers for phylogeny reconstruction can be 
improved upon with resampling from the past. 

The statistical model is not standard, so we summarize it fi rst. For more details 



on phylogenetic trees, we refer the reader to (jFelsensteinl . 120041 ) . Suppose we have n 
aligned deoxyribonucleic acid (DNA) sequences (yi, ■ ■ ■ ,y n ) each of length m, where 
sequence i is from organism i. That is, = (yi(l), • • • , yiijn)) where yi(J) can be one 
of the four nucleotide basis A (Adenine), G (Guanine), C (Cytosine) or T (Thymine). 
Based on these sequences, we would like to infere the phylogenetic tree or evolutionary 
relationships between these organisms. To be precise, we recall that a binary tree r 
for n species is a connected graph (V, E) with vertex set V and edges E, with no 
cycle, such that V = {p} UXUT, where p (the root) has degre 2; any v G I has degre 
3 and any v G T has degre 1. X has n — 2 elements called the internal nodes and 
T (the leaves or the tips) represent the n species. A phylogenetic tree for n species 
is a couple ip = (r, 6), where r is a binary tree for the n species and b € (0,ooy E \, 
where \E\ = 2n — 1 is the cardinality of E. For e G E, b e represents the length of 
edge e, the so-called branch length. We restrict our attention to phylogenetic trees 
with "contemporary tips", where the sum of the branch length b e on the directed path 
from the root to any tip is constant (equal to 1 hereafter). Such phylogenetic trees 
are said to be with a "molecular clock" as the b e can now be interpreted as time. Let 
^ be the set of all phylogenetic trees for n species. For % G V \ {p}, denote p(i) the 
parent of i, that is the vertex p(i) such that (p(i),i) G E. 
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The model of phylogenetic reconstruction we are interested in assumes that there 
are some missing DNA sequences (Vj){je{p}uZ} such that the joint conditional distri- 
bution of (yj)v given the phylogenetic tree ip writes: 

f((yi){iev}\i>) = f(y P ) n / {yi\y P (i),i>) ■ (n) 

iev\{ P } 

In addition we make the simplifying assumption that each site evolves indepen- 
dently: 

m 

f{Vp) = X[f{y P {j))i and (12) 

rn 

f (vi\y P {i)^) = Y[f (yi(j)\y P (i)(j),hp(i),i)) ■ ( 13 ) 

3=1 

And finally, we assume that there exist (ffi)ie{A,G,c,T}, ni > 0, Yl n i = 1) parame- 
ters 9, k G (0, oo) and a 4 x 4 Markov process generator Q = Q(9, k, tt a , tt g , n c , vr r ) 
such that: 

f(y P (j) = = I e {A, G, C, T} and (14) 
/ (ViU) = fn\y P {i){j) = = b ) = exp{bQ) lm , l,m G {A,G,C,T}. (15) 

Th e matrix Q specifies the model of DNA evolution. We use the F84 model 



as in (lLarget and Simon 



19991 ). The parameters of the statistical model are then 
(?/>, 6, k, it a, ttg-, ftc'i kt)- To simplify the sampler, we fix -k a^g^c^t to their em- 
pirical values in the data. We assume that ijj has a uniform prior distribution on 
^ and we assume that 9 and k each has a uniform prior on (0,M), M = 200. Let 
7i (ip, 9, k\ (y)i £ r) be the posterior distribution of the model. Clearly, n (ip, 9, k\ {y)^) oc 
f {{y)i£r\ip,9, k) and this likelihood is obtained by integrating out the missing vari- 
ables {yi)ie{p}ux from (fTll . A fast computation of this likelihood is available with 
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the pruning method of Felsenstein 



distribution, we follow essentially (jLarget and Simon . 



Y. F. Atchade 



Felsenstein 



2004) 



. To s ample from this posterior 



19991 ). We update 9 and k to- 



gether, given the phylogenetic tree 0, using a random walk Metropolis move. Next, 
given 9, k, we update the phyloge netic tree ip with the global move with a molecular 



clock of (jLarget and Simon . 



1999). 



We compare this plain MCMC sampler with the samplers obtained with the two 
methods disc ussed in this paper . For the simulations, we use the primate dataset 



discussed in (jYang and Rannala 



19971 ). The dataset has n = 9 species and the 



phylogeny reconstruction is based on aligned sequences of length m = 888. The three 
samplers are simulated for N = 500, 000 iterations. For each sampler and for each of 
the variables 9, k, we give a plot of the last 5, 000 sample points together with the 
histogram and the autocorrelation function from the last 150, 000 iterations. When 
resampling from the past, the resampling schedule used is B+ \k~\ a , B = 100, 000 and 
a = 1.3. For the third sampler with resampling from an auxiliary process, each of the 
two chains is run for 250, 000 iterations. The auxiliary process is a MCMC chain with 
stationary distribution 7r(°) = n 1 ^ ', with T = 2. The results of the variable 9 (resp. 
k) are given in in Graph 5 (resp. Graph 6). On each graphics, the first column gives 
the result of the plain MCMC sampler, the second column gives the results of the 



MCMC sampler with resampling from the pa st and the results o 



' the third sampler 



19991 ). the outputs 



are in the third column. In accordance with (jLarget and SimonL 
of the three samplers overwhelmingly (with an estimated posterior distribution over 
0.95) select the phylogenetic tree topology plotted in figure 7 as the most probable 



for this primate dataset. 
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e k 

Plain MCMC 1510.23 1271.87 

MCMC with resampling 13.95 24.37 
MCMC with Aux. Proc. 9.18 8.15 



Table 2: Inefficiencies of the samplers for the primates dataset 

Here again, resampling from the past significantly improve on the plain MCMC 
sampler. Table 2 gives the efficiency gains. 



5 Proofs of Theorem 12.11 and I.S.I I 

We start with Theorem 12.11 Without any loss of generality we assume that B, the 
burn-in period is 0. 

5.1 Proof of Theorem 12.11 

The following lemma is a consequence of (A). 

Lemma 5.1. Assume (A). There exists a constant C\ G (0, oo) such that for any 
signed measure ji on (X, B) such that n(X) = and for any n > 0, 

\\^P n \\ v <C lP n M v . (16) 

Proof of Theorem \2.1\ Fix n such that < n < a k+ i, k > 2. For / 6 L v such that 
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\f\ v < 1, define / = / - tt(/). We have: 

E(f{X n )\X = x) = E[E(f(X n )\X ak )\X = x] 
= E{P n ^f(X ak )\X = x) 
= (£ (afe) -tt) [P n - a »J] (x), 

where £^(x,A) = Pr (X afc £ A\X = x). Therefore, since 

|||£( n ) — 7r||| y = sup x . gA . sup ^^v<A E (A Xn ^ x ° x )\ ^ j t follows from Lemma EH that: 



(17) 
(18) 



7T 



< dp" ll£ K) - ttIIL,. 



(19) 



Also, for / £ Ly with |/L < 1, we have: 



C {ak) f(x) = E 



a k -l 



Qfc-l 
a k 



-1 vv K 

+ - J] E(/(X i )|X = x) 



l0 = x 

a k -\ 

a k 



(20) 



3= a k-l 



Then proceding as above and using Lemma iBTil again we get: 



|£ K) - ttIII < exp (-u k ) \\\£ {ak - l] - ir\ 



v 



(21) 



with u k = log(afe) — log(afc_i + c), c = j^. If we define u\ = — a x log(p) and 
5k = Yli=i u ki we get |||£^ — ?r HI < C 2 exp(— 5k) for some finite constant C 2 , which, 
together with fTHl) yields: 



|£M-7r||L = 0(p"- afe exp(-5 fc )) 



(22) 



for a k <n < a k+ i, as wanted. 



□ 



Resampling from the past 

5.2 Proof of Theorem EO 
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Let {X n } be the process generated by the importance-resa mpling scheme. We prove 



Theo rem 13.11 as a consequence of Theorems 3.1 and 3.2 of (jAtchade and Rosenthal 



20051 ). Denote T n the cx-algebra generated by (X , . . . ,X n ). For x G X and A G B, 



define P n (x, A) = Pr (X n G A|X n _i = x) = Pr (X n G A\T n -i, X n _i = x). We have: 



where = E 



P n (x, A) = 9P(x, A) + (1 - 0)// n (A), 



E£oM*j 0) ) 



(23) 



Define M r = sup n E (V r (X^ 0) )) , r > 0. It follows from (AO) that M r < M ± < oo 
for all r G [0, 1]. For p > 0, we write = ^(xf 0) ), s n = ElZl^k, V? = V a (xf ] ) 



and /ii p) = E 



The next lemma is crutial. 



Lemma 5.2. For p G [1,4-], max <i<„_i E 



O (i) and = 0(1) as 



n — > 00. 



Proof. By the Minkowski inequality, we only need to prove that max <j< n _i E 

Write c = A(/i) and Co = A(/i^). For < i < n — 1 and k G (0, c/cq), we have: 



E 



E 



L{s„>n(c/co— k)} 



+ E 



-{s„<n(c/co— k)} 



E 



-{s n >n(c/c -K)} 



< 



nP(c/c - «)f 



E[uiV«] p < 



\u\y a Ml 

nP(c/c - k)p' 
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By the Cauchy-Schwarz inequality, we can bound the second term as follows: 



E 



L{s„<n(c/co— k)} 



2p 



Pr 



< E 1 / 2 [V? pa ] Pr 



< 2 ^1/2 

- 1 , ; E 1/2 2^ 



if 



UJi I < —K 

Co 



1/2 



i=0 



UJi I < —K 

c 



1/2 



n 2p K 4p 



,i=0 



UJi 

c 



Ip 



where for the last line, the Markov inequality was used. Now we use the classi- 
cal Poisson equation and martingale approximation technique. Since uj < V a , the 
Poisson equation uj — c/c = g — P^g has a solution g which satisfies \g\ < V a . 
With this solution, for n > 1, we can rewrite Y^=o w » ~ c / c o — M n + W n where 
W n = g(X^) - PWgixMj, M n = YT'l 9{xf ] ) - P^g{xf\) and (M n ) is a mar- 
tingale. Therefore with the Minkowski inequality, we get: E 1 / 2 (Yl^oi^i ~ c /c )) 4p < 
{M n ) Ap + ¥}I a p [W n ) Ap ] 2p . Since \g\ ya < oo and 4pa < 1, it follows from As- 
sumption (AO) that sup^-E (g(Xp) - P^aixf)) ^ < oo. Therefo r e (E 1 / ^ (W n f p ) 



is bounded. Using Burkholder's inequality (see e.g. (jHall and Hevde 



1MJ)), we have 



the bound: 



'n-l 



E(M n ) Ap < K 3 E[J2(9(X^)-P^g(X^) 



i=l 
n-l 



2p 



i=l 



2p 



for some finite constants -K" 3 , l^Q. This implies that E 1 / 2 (^^(^ ~~ c /c )) 4p = 0( 
which finishes the proof. 



□ 
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Lemma 5.3. For all n>l,P n has an invariant distribution ir n , and for all k>0, 



\p k -7r n \\\<ce k P k , 



where the constant C G (0, oo) does not depend on n or k. Moreover 



7T„(/) — ► 7r(/), as n -> oo, 



for any measurable function f, with \ f\ Va < oo. 



(24) 



(25) 



Proof. One can directly check that the invariant distribution of P n is n n where: 



\t=0 , 

And by recurrence, we can check that for k > and g G Lye*: 



(26) 



^0 - «n(g) = k P k g - (1 - #K [y^ FPtg] . (27) 
Therefore |||P^ — 7Tn||y a < # fc p fc + su Pn A t n(^ /Aa )j and according to Lemma 

sup n /i n (V a ) is finite. 

For / G Lya , we write C(/) = (1-0) E,=o ^7 ^ L v «. We have |7r„(/) - tt(/)| = 
K (C(/))|, where / = /- tt(/). Note that tt(C(/)) = 0. We recall: 



/in (/) = E 



E£>(4 0) )/(4 0) : 

EK^ 0) ) 



(28) 



From the strong law of large numbers for {X^}, the expression under the expectation 
in (|28|) converges a.s. to as n — >• oo. On the other hand, for p G (1, l/4a), 



E 



eS^(^ 0) )/(4 0) ; 



(29) 
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,(?h ic Q hr^nnrW oormonno Tliorpfnrp eormonno I ^£=0 ^i". ) 



and (/in ) is a bounded sequence. Therefore the sequence I — fc ~° n _ 1 fc — — is 
uniformly integrable and it follows that fJL n (f) as n — > oo. □ 



Lemma 5.4. 



|P„ - Pn-llya + IKn - TTn-lHv" = O ( - ) . (30) 



Proof. For n > 1, we have: |||P n — P n -i|ya + |Fn — 7Tn-i||y a < 2(1 — 9)E 



En — 1 

and the lemma follows from Lemma f5. 21 □ 



Proof of Theorem\3.1\ Follow s from Lemmas 15.31 and 15.41 and Theorems 3.1, 3.2 of 
Atchade and Rosenthal! 1)20051 ). □ 
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Graph 2: Outputs for a. Sterling dataset. First column is the plain Gibbs, second 
column is resampling from the past; last column: resampling from an auxiliry Gibbs 

sampler. 
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Graph 3: Outputs for 0. Sterling dataset. First column is the plain Gibbs, second 
column is resampling from the past; last column: resampling from an auxiliry Gibbs 

sampler. 
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Graph 4: Outputs for (3. Sterling dataset. First column is the plain Gibbs, second 
column is resampling from the past; last column: resampling from an auxiliary 

Gibbs sampler. 
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Graph 5: Outputs for 9. Primates dataset. First column is the plain MCMC, 
second column is resampling from the past; last column: resampling from an 

auxiliary MCMC sampler. 
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Graph 6: Outputs for k. Primates dataset. First column is the plain MCMC, 
second column is resampling from the past; last column: resampling from an 

auxiliary MCMC sampler. 
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Graph 7: The most probable phylogenetic tree topology in the primates dataset. 



